Introduction

Since the beginning of the Landsat missions, the remote sensing community has been interested in developing universal algorithms for extracting water quality information from remotely sensed images [@Lots of old papers]. While there has been significant success in the oceanic community towards universal algorithms for chlorophyll, sediment, and doc [cites], there is no inland water equivalent. Much of this discrepancy comes from the increased optical complexity of inland waters, which prevents the use of a more universal algorithm, but progress on inland waters is further impeded by the lack of a shared dataset of overpasses and in situ concentration information. Here we create and share the largest such overpass dataset ever assembled. We also outline and share our approach to bringing three publicly available, free datasets to generate a high-graded analysis-ready dataset for remote sensors of water quality. While a specific universal algorithm may be an unattainable goal, we anticipate that this dataset will move us towards more universal approaches based on shared and equal access to overpass information.

Potential for transformative research with remote sensing of water quality

Despite the long-recognized potential, until recently, the general hydrology and limnology communities have not integrated data from remote sensing of inland waters into our research approach [Topp]. Instead, these communities have focused much of our research on Eulerian sampling schemes with sensors or people repeatedly sampling the same points in a river or lake [DoyleEnsign]. This research approach has generated a wealth of information on temporal variability in inland waters, but there has been less work looking at spatial variability in rivers, lakes, and estuaries. Remote estimates of water quality in these ecosystems would allow for rapid assessment of potential algae blooms, detection of high-sediment waters, and analysis of spatio-temporal variability [cites].

Historic barriers

Serious citation of Topp, maybe none of this at all?

Modern solutions

With the profusion of publicly available in situ water quality datasets and the relatively easily-accessible satellite mission archive

Methods

LANDSAT

Satellite Years Available images
5 1984-2012 192,688
7 1999-2018 188,781
8 2013-2018 58,585

WQP Parameters

Rivers, Lakes, and Estuaries/Deltas

Water Quality Portal

data pull and parameters therein

LAGOSNE

Describe Lagos daasets

In Situ data unification

Joining landsat and water quality portal

Google Earth Engine

How we selected sites (pekel occurence)

Diagram of joining procedures and counts of observations dropped

Data quality flagging

Not sure what to put here or if we should have this section

Results

For LAGOSNE data see here

Dataset description

Dataset generation

Dataset generation

Full harmonized water quality portal and LAGOSNE dataset

type chl_a doc secchi tss
Estuary 170,549 39,186 363,607 160,532
Lake 837,747 73,587 2,041,409 195,557
Stream 374,772 339,972 346,537 2,735,913
Total 1,383,068 452,745 2,751,553 3,092,002

Landsat visible sites including lagos

Map

Distribution of observations per site

type chl_a doc secchi tss
Estuary 28,940 5,826 47,956 26,883
Lake 128,453 9,545 350,428 26,864
Stream 22,403 16,353 35,101 54,297
Total 179,796 31,724 433,485 108,044
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Obervations over time

In Situ Distribution of data

## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 51 rows containing missing values (geom_bar).

Spectral library

All parameters no type breakdown

## Warning: Removed 581 rows containing non-finite values (stat_boxplot).

What can we do with this data?

## Reading layer `ne_10m_rivers_lake_centerlines' from data source `/Users/mrross/Dropbox/UNC-PostDocAll/aquasat/ne_10m_rivers_lake_centerlines/ne_10m_rivers_lake_centerlines.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1455 features and 34 fields (with 1 geometry empty)
## geometry type:  MULTILINESTRING
## dimension:      XY
## bbox:           xmin: -164.9035 ymin: -52.15773 xmax: 177.5204 ymax: 75.79348
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 10 x 5
##    name        rmse count  mdae  mape
##    <chr>      <dbl> <int> <dbl> <dbl>
##  1 Suwannee    1.40    34 0.999 0.268
##  2 Niagara     1.52    26 1.01  0.250
##  3 Willamette  1.58    22 1.38  0.447
##  4 St. Clair   2.25    31 1.73  0.463
##  5 Wisconsin   2.54    26 1.92  0.290
##  6 Roanoke     5.31    68 1.97  0.356
##  7 Clark Fork  5.69    32 3.15  0.626
##  8 Tennessee   6.34   108 3.75  0.503
##  9 Pee Dee     7.78   124 4.00  0.464
## 10 St. Johns   7.71   585 4.71  0.617

Mississipip Basin Example

Dominant signals

Supplementary stuff

Spectral medians captured at sample sites

Spectral variation captured by our circles

## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 181967 rows containing non-finite values (stat_density).

Lots of PCA analyses

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     2.1670 1.5288 1.0672 0.67609 0.52972 0.24604
## Proportion of Variance 0.5218 0.2597 0.1265 0.05079 0.03118 0.00673
## Cumulative Proportion  0.5218 0.7815 0.9080 0.95880 0.98998 0.99671
##                            PC7     PC8     PC9
## Standard deviation     0.16113 0.05433 0.02650
## Proportion of Variance 0.00288 0.00033 0.00008
## Cumulative Proportion  0.99959 0.99992 1.00000
## Warning: Removed 351499 rows containing non-finite values (stat_boxplot).

## # A tibble: 4 x 5
##   cluster chl_a   doc   tss secchi
##   <chr>   <dbl> <dbl> <dbl>  <dbl>
## 1 1        4     8.37     3   3.66
## 2 2        8.01  4.23    10   1.52
## 3 3       10.7   4.3     21   0.8 
## 4 4        7.64  5.5      7   2.44

Clusters mapped onto tss doc plot.

Breakdown of other parameters with low and high tss

## Warning: Removed 286 rows containing non-finite values (stat_boxplot).